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Abstract 



Variable selection in high-dimensional space characterizes many contemporary prob- 
lems in scientific discovery and decision making. Many frequently-used t echniques are 
based on independence screening; examples include correlation ranking ( Fan and Lv . 



20081) or feature selectio n using a two-sample t-test in high-dimen sional classification 
(jTibshirani et al . 2003f l. Within the context of the hnear model, Fan and Lv ( 2008 ) 
showed that this simple correlation ranking possesses a sure independence screening 
property under certain conditions and that its revision, called iteratively sure indepen- 
dent screening (ISIS), is needed when the features are marginally unrelated but jointly 
related to the response variable. In this paper, we extend ISIS, without explicit defini- 
tion of residuals, to a general pseudo-likelihood framework, which includes generalized 
linear models as a special case. Even in the least-squares setting, the new method 
improves ISIS by allowing variable deletion in the iterative process. Our technique 
allows us to select important features in high-dimensional classification where the pop- 
ularly used two-sample t-method fails. A new technique is introduced to reduce the 
false discovery rate in the feature screening stage. Several simulated and two real data 
examples are presented to illustrate the methodology. 
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1 Introduction 



The remarkable development of computing power and other technology has allowed 
scientists to collect data of unprecedented size and complexity. Examples include 
data from microarrays, proteomics, brain images, videos, functional data and high- 
frequency financial data. Such a demand from applications presents many new chal- 
lenges as well as opportunities for those in statistics and machine learning, and while 
some significant progress has been made in recent years, there remains a great deal 
to do. 



A very common statistical problem is to model the relationship between one or more 
output variables Y and their associated covariates (or predictors) Xi, . . . , Xp, based on 
a sample of size n. A characteristic feature of many of the modern problems mentioned 
in the previous paragraph is that the dimensionality p is large, potentially much larger 
than n. Mathematically, it makes sense to consider p as a function of n, which diverges 
to infinity. The dimensionality grows very rapidly when interactions of the features 
are considered, which is necessary for many scientific ende avors. For example, in 



disease classificatio n using microarray gene expression data (iTibshirani et al.l . 12003 



Fan and Reru . 120061 ). the number of arrays is usually in the order of tens or hundreds 
while the number of gene expression profiles is in the order of tens of thousands; 
in the study of protein-protein interactions, the sample size may be in the order of 
thousands, but the number of predictors can be in the order of millions. 

The phenomenon of noise accumulation in high- dimensional classificatio n and regres- 



Hastie et al. 



sion h a s long been observed by statisticians and computer scientists (see 
(120011). iFan and Fan! (120081) and references therein) and has been clearly demonstrated 
by iFan and FanI (120081 ). Various feature selection techniques have been proposed. A 
popular family of methods is based on penahzed least-squa res or, more gener ally, pe- 
nalized pseudo-like lihood. Examples incl ude the LASSO ( Tibshiranil . Il996l ). SCAD 
( Fan and Lil . 2001 ). the Dantzig selector ( Candes and Tao . 2007 ). and their related 
methods. These methods have at tracted a great deal of th e oretical study and algo- 
rithm ic development recently. See Donoh o and Elad J 2003 Y Efron et al. J 2004*) , 'Zoii| 
(l2006h.lMeinshausen and Biihlmann (2006il . Izhao and Yul (mod ). IZou and~L (2008,1 , 



Bickel et al.l (120081 ) . and references therein. However, computation inherent in those 
methods makes them hard to apply directly to ultrahigh-dimensional statistical learn- 
ing problems, which involve the simultaneous challenges of computational expediency, 
statistical accuracy, and algorithmic stability. 
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A method that takes up the aforementioned thr ee challenges is th e idea of inde- 
pendent learning, proposed and demonstrated by iFan and Lvl (120081 ) in the regres- 
sion co ntext. The me thod can be derived from an empirical likelihood point of 



(Bair et al.. 


2006 


Paul et al.. 


2008) 



. ■ important, but limited, context of the 

linear model, iFan and Lv mm proposed a two-stage procedure to deal with this 



problem. First, so-called independence screening is used as a fast but crude method 
of reducing the dimensionality to a more moderate size (usually below the sample 
size); then, a more sophisticated technique, such as a penalized likelihood method 
based on the smoothly clipped absolute deviation (SCAD) penalty, can be applied to 
perform the final variable selection and parameter estimation simultaneously. 

Independence screening recruits those predictors having the best marginal utility, 
which corresponds to the largest marginal correlation with the res ponse in the context 
of least-squares regression. Under certain regularity conditions, iFan and Lvl (120081 ) 
show surprisingly that this fast variable selection method has a sure screening prop- 
erty; that is, with probability very close to 1, the independence screening technique 
retains all of the important variables in the model. As a result of this theoreti- 
cal justification, the method is referred to as Sure Independence Screening (SIS). 
An important methodological extension, called Iterated Sure Independence Screening 
(ISIS), covers cases where the regularity conditions may fail, for instance if a predic- 
tor is marginally uncorrelated, but jointly correlated with the response, or the reverse 
situation where a predictor is jointly uncorrelated but has higher marginal correlation 
than some important predictors. Roughly, ISIS works by iteratively performing vari- 
able selection to recruit a small number of predictors, computing residuals based on 
the model fitted using these recruited variables, and then using the working residuals 
as the response variable to continue recruiting new predictors. The crucial step is to 
compute the working residuals, which is easy for the least-squares regression problem 
but not obvio us for other proble ms. The improved performance of ISIS has been 
documented in lFan and Lvl (120081 ). 



Independence screening is a commonly used techniques for feature selection. It has 
been widely used for gene selection or disease classification in bioinformatics. In 
those applications, the genes or proteins are called statistically significant if their 
associated expressions in the treatment group differ statistically from the control 
group, resulting i n a large and activ e lite rature on th e multiple testing problem. 
See, for example, iDudoit et al.l (120031 ) and lEfronl (120081 ). The sele cted features are 



frequently used for tumor/disease classification. See, for example, iTibshirani et al. 
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(120031 ) , and iFan and Reru (120061 ) . This scree ninR method i s inde ed a form of inde- 
pendence screening and has been justified by iFan and Fan] (120081 ) under some ideal 
situations. However, common sense can carry us only so far. As indicated above and 
illustrated further in Section 4.1, it is easy to construct features that are marginally 
unrelated, but jointly related with the response. Such features will be screened out by 
independent learning methods such as the two-sample t test. In other words, genes 
that are screened out by test statistics can indeed be important in disease classifica- 
tion and understanding molecular mechanisms of the disease. How can we construct 
better feature selection procedures in ultrahigh dimensional feature space than the 
independence screening popularly used in feature selection? 

The first goal of this paper is to extend SIS and ISIS to much more general models. 
One challenge here is to make an appropriate definition of a residual in this context. 
We describe a procedure that effectively sidesteps this issue and therefore permits 
the desired e xtens ion of ISIS. In fact, our method even improves the original ISIS of 
Fan and Lvl (120081 ) in that it allows variable deletion in the recruiting process. Our 
methodology applies to a very general pseudo-likelihood framework, in which the aim 
is to find the parameter vector f3 = . . . ,/5p)^ that is sparse and minimizes an 
objective function of the form 



where (x^, Yj) are the covariate vector and response for the z*'' individual. Important 
applications of this methodology, which is outlined in greater detail in Section [21 
include the following: 



1. 



2. 



Generalized linear models: All generalized linear models, including logistic 
regression and Poisson l og-linear models, fit very natu rally into our method- 
ological framework. See iMcCuUagh and Nelderl (119891 ) for many applications 
of generalized linear models. Note in particular that logistic regression models 
yield a popular approach for studying classification problems. In Section HI we 
present simulations in wh ich our approach compares favorably with the com- 
peting LASSO technique (ITibshiranil . Il996l ). 



Classification: Other common approaches to classification assume the re- 
sponse takes values in { — 1, 1} and also fit into our framework. For instance, 
support vector machine classifiers use the hinge loss function L{Yi, ^f(3) = (1 — 
YiX.f(3)^, while the boosting algorithm AdaBoost uses L(Yi, ^J(3) = exp(— Fjxf /3). 
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3. Robust fitting: In a high-dimensional hnear model setting, it is advisable 
to be cautious about the assumed relationship between the predictors and the 
response. Thus, instead of the conventional least squares loss function, we may 
prefer a rob ust loss funct ion such as the Li loss L(Fj,xf/3) = \Yi — xf/3| or the 
Huber loss (IHuberl . Il964j ). which also fits into our framework. 



Any screening method, by default, has a large false discovery rate (FDR), namely, 
many unimportant features are selected after screening. A second aim of this paper, 
covered in Section [3l is to present two variants of the SIS methodology, which re- 
duce significantly the FDR. Both are based on partitioning the data into (usually) 
two groups. The first has the desirable property that in high-dimensional problems 
the probability of incorrectly selecting unimportant variables is small. Thus this 
method is particularly useful as a means of quickly identifying variables that should 
be included in the final model. The second method is less aggressive, and for the 
linear model has the same sure screening property as the original SIS technique. The 
applications of our proposed methods are illustrated in Section 5. 



2 ISIS methodology in a general framework 



Let y = (Fi, . . . , y„) be a vector of responses and let xi, . . . , x„ be their associated co- 
variate (column) vectors, each taking values in W. The vectors (xf , Yi), . . . , (x^, F„) 
are assumed to be independent and identically distributed realizations from the pop- 
ulation [Xi, . . . , Xp, Y)^. The n X p design matrix is X = (xi, . . . , x„)^. 



2.1 Feature ranking by marginal utilities 

The relationship between Y and {Xi, . . . , Xp)'^ is often modeled through a parameter 
vector f3 = {/3i, . . . , (3p)'^ , and the fitting of the model amounts to minimizing a 
negative pseudo-likelihood function of the form 

n 

Q{(3o, /3) = n'' Yl ^(^- ^0 + 

i=l 
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Here, L can be regarded as the loss of using [3q + to predict Fj. The marginal 
utility of the j-feature is 



mmn 

/3o,/3i 



(2) 



which minimizes the loss function, where Xj 



(Xji, . . . , Xjp)-^. The idea of SIS in this 
framework is to compute the vector of marginal utilities L = (Li, . . . , Lp)"^ and rank 
them according to the marginal utilities: the smaller the more important. Note that 
in order to compute Lj, we need only fit a model with two parameters, /3o and (3j, so 
computing the vector L can be done very quickly and stably, even for an ultrahigh 
dimensional problem. The variable Xj is selected by SIS if Lj is one of the d smallest 
components of L. Typically, we may take d = \n/ lognj. 

The procedure above is an independence screening method. It utilizes only a marginal 
relation between predictors and the response variable to screen variables. When d is 
large enough, it possesses the sure screening property. For this reason, we call the 
method Sure Independenc e Screening (SIS). For classification problems with quadratic 



loss L, iFan and Lvl (120081 ) shows that SIS r educes to feature screening using a two 



sample t-statistic. See also iHall et al.l (120081 ) for a derivation from an empirical like- 
lihood point of view. 



2.2 Penalized pseudo- likelihood 

With variables crudely selected by SIS, variable selection and parameter estimation 
can further be carried out simultaneously using a more refined penalized (pseudo)- 
likelihood method, as we now describe. The approach takes joint information into 
consideration. By reordering the variables if necessary, we may assume without 
loss of generality that Xi, . . . ,Xd are the variables recruited by SIS. We let Xj^^ = 
(Xji, . . . , Xid)^ and redefine (3 = {Pi, . . . , Pd)'^- In the penalized likelihood approach, 
we seek to minimize 

n d 

£(/?o,/3)=n-i5^L(y„/5o + xJ,/3) + 5^p,(|/5,|). (3) 

Here, p\{-) is a penalty function and A > is a regularization parameter, which may be 
chosen by five-fold cross-validation, for example. The penalty function should satisfy 
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certain conditions in order for the resulting estimates to have desirable properties, 
and in particul ar to yield sparse solutions in which some of the coefficients may be 
set to zero; see iFan and Lil (120011 ) for further details. 



Commonly used examples of penalty func tions include the Li penalty Pa(|/5|) = MP\ 
( Tibshirani . 1996 : Park and Hastid . 2007 ). the smoothly clipped absolute deviation 
(SCAD) penalty (IFan and Lil . 1200 ll ). which is a quadratic spline with p\{0) = and 



Pa(I/5|) = A 1{|^I<A} + 



(aX 



1)A 



-1 



{|/3|>A} h 



for some a > 2 and > 0, and the minimum concavity penalty (MCP ), p\(\P\) = 
(A — \P \ /a)+ (IZhangj . 120071 ). The choice a = 3.7 has been recommended in lFan and Li 
(I2OOII ). Unlike the Li penalty, SCAD and MC penalty func tions have flat tails 



which are fundarnental i n reducing biases due to penalization (lAntoniadis and Fan 



2001; 


Fan and Li. 


2001 


)• 



minimizing the objective function for the Li penalty, and IZhang) (120071 ) propose a 
PLUS algorithm for flnding solution paths to the penalized le a st-squ ares problem 
with a general penalty px{-). On the other hand, iFan and Lil (120011 ) have shown 



that the SCAD-type of penalized loss f unction can be mi nimized iteratively using a 



local quadratic approximation, whereas IZou and Lil (120081 ) propose a local linear ap 



proximation, taking the advanta ge of recently developed algorithms for penalized Li 
optimization ( Efron et al.l . 120041 ) . Starting from /3''°-* = as suggested by lFan and Lv 



(I2OO8I ). using the local linear approximation 

Pxm)^pxi\p^''^\)+p'xi\p^'^\m-\p^'^\), 

in ([3]), at the {k + 1)*^ iteration we minimize the weighted Li penalty 



n 



(4) 



1=1 



where w^''^ = p'^{\l3j'''\). Note that with initial value f3 



(0) 



0, /3(^) is indeed a LASSO 
estimate for the SCAD and MC penalty, since p'x{0+) = A. In other words, zero is 
not an absorbing state. Though motiva ted slightly differently, a weighted Li penalty 



20061 ): in that case Wj'''^ = Wj 



1/1/5,1 



is also the basis of the adaptive Lasso (iZou . 

where f3 = {(3i, . . . , f3d)'^ may be taken to be the maximum likelihood estimator, and 
7 > is chosen by the user. The drawback of such an approach is that zero is an 
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absorbing state when (jlj) is iteratively used — components being estimated as zero 
at one iteration will never escape from zero. 



For a class of penalty furictions that includes the SCAD penalty and when p is fixed 
as n diverges, iFan and Lil (120011 ) established an oracle property; that is, the penalized 
estimates perform asymptotically as well as if an oracle had told us in advance which 
components of (3 were non-zero. Fan and Pend (2004) e xtended thi s result to cover 
situations where d may diverge with d = dn = o{n^^^). IZoul (120061 ) shows that the 
adaptive LASSO poss esses the oracle property too, when i s finit e. See also further 
theoretical studies by IZhang and Huangl (120081 ) and IZhangl (120071 ). We refer to the 
two-stage procedures described above as SIS-Lasso, SIS-SCAD and SIS-AdaLasso. 



2.3 Iterative feature selection 

The SIS methodology may break down if a predictor is marginally unrelated, but 
jointly related with the response, or if a predictor is jointly uncorrelated with the 
response but has higher marginal correlation with the response than some important 
predictors. In the former case, the important feature has already been screened at the 
first stage, whereas in the latter case, the unimportant feature is ranked too high by 
the independent screening technique. ISIS seeks to overcome these difficulties by using 
more fully the joint covariate information while retaining computational expedience 
and stability as in SIS. 

In the first step, we apply SIS to pick a set Ai of indices of size ki, and then employ 
a penalized (pseudo)-likelihood method such as Lasso, SCAD, MCP or the adaptive 
Lasso to select a subset A4i of these indices. This is our initial estimate of the set of 
indices of important variables. The screening stage solves only bivariate optimizations 
([2]) and the fitting part solves only a optimization problem ([3]) with moderate size ki. 
This is an attractive feature in ultrahigh dimensional statistical learning. 

Instead of computing residuals, as could be done in the linear model, we compute 

n 

Lf = min n-' ^(^- + <M,(^M. + ^u/?.), (5) 
^ 1=1 

for j e J^AI = {!,..., p}\A4i, where ^i^Mi is the sub- vector of Xj consisting of those el- 
ements in All. This is again a low- dimensional optimization problem which can easily 
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be solved. Note that lJ''^ [after subtracting the constant min/3Q^/3^^ n ^ Yll=i L{Yi, /3o+ 
^Imi^^Mi) changing the sign of the difference] can be interpreted as the additional 
contribution of variable Xj given the existence of variables in A^i. After ordering 
{Lj : j E J^l}, we form the set A2 consisting of the indices corresponding to the 
smallest k2 elements, say. In this screening stage, an alternative approach is to substi- 
tute the fitted value (3j^_^ from the first stage into and the optimizat ion problem 
(O wo uld only be bivariate. This approach is exactly an extension of iFan and Lv 
(I2OO8I ) as we have 



L{Yi, Po + + Xij(3j) — {vi — (3q — XijfijY, 

for the quadratic loss, where fj = Fj — ^J^Mif^Mi residual from the previous 

step of fitting. The conditional contributions of features are more relevant in re- 
cruiting variables at the second stage, but the computation is more expensive. Our 
numerical experirn ents in Section K4l shows the improvement of such a deviation from 
Fan and Lvl (120081 ). 



After the prescreening step, we use penalized likelihood to obtain 



32= argmin n ^ ^ /5o + 



(3. 



Again, the penalty term encourages a sparse solution. The indices of (^2 that are 
non-zero yield a new estimated s et of ac t ive in dices. This step also deviates 
importantly from the approach in iFan and Lvl (120081 ) even in the least-squares case. 
It allows the procedure to delete variables from the previously selected variables with 
indices in A^i. 

The process, which iteratively recruits and deletes variables, can then be repeated 
until we obtain a set of indices M.^ which either has reached the prescribed size d, or 
satisfies M.i = Aie-i. In our implementation, we chose ki = \_2d/3\, and thereafter 
at the rth iteration, we took kr = d — \ Air-i\- This ensures that the iterated versions 
of SIS take at least two iterations to terminate; another possibility would be to take, 
for example, k^ = mm{5,d — Of course, we also obtain a final estimated 

parameter vector 13^. The ab ove method can be considered as an analogue of the 
least squares ISIS procedure (IFan and Lvl . l2008l ) without explicit definition of the 
residuals. In fact, it is an improvement even for the least-squares problem. 



Fan and Lvl (120081 ) showed empirically that for the linear model ISIS improves signif- 
icantly the performance of SIS in the difficult cases described above. The reason is 
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that the fitting of the residuals from the (r — 1)*'' iteration on the remaining predic- 
tors significantly weakens the priority of those unimportant variables that are highly 
correlated with the response through their associations with {Xj : j G A4r-i}- This 
is due to the fact that the variables {Xj : j G Air-i} have lower correlation with the 
residuals than with the original responses. It also gives those important predictors 
that are missed in the previous step a chance to survive. 

2.4 Generalized linear models 

Recall that we say that Y is of exponential dispersion family form if its density can 
be written in terms of its mean n and a dispersion parameter as 

/y (y; /i, 0) = exp I ^^^^^^^^^ + c(i/, 0) I , 

from some known functions ^(■), and c(-,-). In a generalized linear model for 
independent responses Yi, . . . , y„, we assert that the conditional density oiYi given the 
covariate vector Xj = Xj is of exponential dispersion family form, with the conditional 
mean response [li related to Xj through gi^^ii) = ^Jf3 for some known link function g{-), 
and where the dispersion parameters are constrained by requiring that (pi = (pai, for 
some unknown dispersion parameter and known constants ai, . . . , a„. For simplicity, 
throughout the paper, we take a constant dispersion parameter. 

It is immediate from the form of the likelihood function for a generalized linear model 
that such a model fits within the pseudo-likelihood framework of Section |H In fact, 
we have in general that 

n 
i=l 

If we make the canonical choice of link function, g{-) = 6{-), then ([7]) simplifies to 

n 

m,^fl3) = 5^{6(xf/3) - K,xf/3}. 

1=1 

An elegant way to handle classification problems is to assume the class label takes 
values or 1, and fit a logistic regression model. For this particular generalized linear 
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model, we have 

n 

m,^Jf3) = J]{log(l + e^"/3) - 

i=l 

while for Poisson log-linear models, we may take 

n 
i=l 

3 Reduction of false discovery rates 



Sure independence screening approaches are simple and quick methods to screen out 
irrelevant variables. They are usually conservative and include many unimportant 
variables. In this section, we outline two possible variants of SIS and ISIS that have 
attractive theoretical properties in terms of reducing the FDRs. The first is an ag- 
gressive variable selection method that is particularly useful when the dimensionality 
is very large relative to the sample size; the second is a more conservative procedure. 



3.1 First variant of ISIS 

It is convenient to introduce some new notation. We write A for the set of active 
indices - that is, the set containing those indices j for which jSj 7^ in the true model. 
Write X_4 = {Xj : j G A} and Xy^c = {Xj : j G A^} for the corresponding sets of 
active and inactive variables respectively. 

Assume for simplicity that n is even, and split the sample into two halves at random. 
Apply SIS or ISIS separately to the data in each partition (with d = [n/\ogn\ or 
larger, say), yielding two estimates A^^^ and A^"^^ of the set of active indices A. Both of 
them should have large FDRs, as they are constructed fror n a crude scr e ening method. 



Assume that both sets have the sure screening property (jPan and Lvl . |2008| ): 



P{A C A'^^'>) -> 1, for j = 1 and 2. 

Then, the active variables should appear in both sets with probability tending to one. 
We thus construct A = A^^^ fl A^'^^ as an estimate of A. This estimate also possesses 
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a sure screening property: 

However, this estimate contains many fewer indices corresponding to inactive vari- 
ables, as such indices have to appear twice at random in the sets A^^'' and A^'^^ . This 
is indeed shown in Theorem [1] below. 

Just as in the original formulation of SIS in Section [21 we can now use a penalized 
(pseudo)-likelihood method such as SCAD to perform final variable selection from A 
and parameter estimation. We can even proceed without the penalization since the 
false discovery rate is small. 

In our theoretical support for this variant of SIS, we will make use of the following 
condition: 

(Al) Let r G N, the set of natural numbers. We say the model satisfies the ex- 
changeability condition at level r if the set of random vectors 

{(Y, X^, Xj-^, . . . , Xj^) : ji, . . . ,jr are distinct elements of A^} 

is exchangeable. 

This condition ensures that each inactive variable is equally likely to be recruited by 
SIS. In Theorem [1] below, the case r = 1 is particularly important, as it gives an upper 
bound on the probability of recruiting any inactive variables into the model. Note that 
this upper bound requires only the weakest version (level 1) of the exchangeability 
condition. 

Theorem 1. Let r G N, and assume the model satisfies the exchangeability condition 
(Al) at level r. If A denotes the estimator of A from the above variant of SIS, then 

Pd^n^^l > r) < < -( --) , 

^' I - ^ - (p-\^\^ - r\\p- \A\J 

where, for the second inequality, we require d^ < p — 1^| and d is the prescribed number 
of selected variables in A^^^ or AS^\ 
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Proof. Fix r G M, and let J' = {{ji, ■ ■ ■ ,jr) '■ ji, ■ ■ ■ ,jr are distinct elements of A'^}. 
Then 

(il,-Jr)Gj 

in which we use the random splitting in the last equality. Obviously, the last proba- 
bility is bounded by 

max P(jiG^«,--- ,>g1«) V P(ji G ■ ■ ■ , jV G (8) 

Oi,-jr)e^ 

Since there are at most d inactive variables from A'^ in the set A.^^\ the number of 
r-tuples from falling in the set A^^^ can not be more than the total number of such 
r-tuples in A^^\ i.e. 



Thus, we have 

P(jiel«,---,jVGl«)<n. (9) 

Substituting this into ([H]), we obtain 



Pi\AnA'\>r) < ('^] max P(ji G ■ ■ ■ G 

\rj {ji,-,jr)ej 

Now, under the exchangeability condition (Al), each r-tuple of distinct indices in A"^ 
is equally likely to be recruited into A^^\ Hence, it follows from Q) that 

max P(ji G A''^\ ■ ■ ■ , JV G ^(^)) < ^'^^ 



and the first result follows. The second result follows from the simple fact that 

id-iV d^ 

< — , for all < z < rf, 

p* — i p* 

13 



where p* = p — \A\, and the simple calculation that 



0' _ 1 d\d - 1)2 . . . (rf - r + 1)2 ^ 1 fd 



{^*) p*{p* — 1) ■ ■ ■ (p* — r + 1) r! \p* 
This completes the proof. □ 



Theorem 1 gives a nonasymptotic bound, using only the symmetry arguments. From 
Theorem [1], we see that if the exchangeability condition at level 1 is satisfied and if 
p is large by comparison with n^, then when the number of selected variables d < n, 
we have with high probability this variant of SIS reports no 'false positives'; that is, 
it is very likely that any index in the estimated active set also belongs to the active 
set in the true model. The nature of this result is a little unusual in that it suggests a 
'blessing of dimensionality' - the bound on the probability of false positives decreases 
with p. However, this is only part of the full story, because the probability of missing 
elements of the true active set is expected to increase with p. 

Of course, it is possible to partition the data into K > 2 groups, say, each of size n/ K, 
and estimate A by A^^^ CiA^'^^ fl . . . riA^^\ where A^''^ represents the estimated set of 
active indices from the kth partition. Such a variable selection procedure would be 
even more aggressive than the K = 2 version; improved bounds in Theorem [1] could 
be obtained, but the probability of missing true active indices would be increased. 
As the K = 2 procedure is already quite aggressive, we consider this to be the most 
natural choice in practice. 

In the iterated version of this first variant of SIS, we apply SIS to each partition 
separately to obtain two sets of indices A^i'^ and A^i \ each having ki elements. After 
forming the intersection Ai = ^^^^fl^^'', we carry out penalized likelihood estimation 
as before to give a first approximation A^i to the true active set of variables. We 
then perform a second stage of the ISIS procedure, as outlined in Section [2l to each 
partition separately to obtain sets of indices A^i U A^2^ and A^i U A^2^. Taking the 
intersection of these sets and re-estimating parameters using penalized likelihood as 
in Section [2] gives a second approximation to the true active set. This process 
can be continued until we reach an iteration i with Aii = Aie~i, or we have recruited 
d indices. 
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3.2 Second variant of ISIS 



Our second variant of SIS is a more conservative variable selection procedure and 
also relies on random partitioning the data into K = 2 groups as before. Again, we 
apply SIS to each partition separately, but now we recruit as many variables into 
equal-sized sets of active indices A^^^ and A^'^^ as are required to ensure that the 
intersection A = A^^^ fl A^'^^ has d elements. We then apply a penalized pseudo- 
likelihood method to the variables = {Xj : j G A} for final variable selection and 
parameter estimation. 

Theoretical support for this method can be provided in the case of the linear model; 
namely, under certain regularity conditions, this variant o f SIS posses s es the sure 



screening property. More precisely, if Conditions (l)-(4) of iFan and Lvl (120081 ) hold 



with 2k + T < 1, and we choose d = \n/ lognj , then there exists C > such that 

P{A<^A) = 1 - Oiexpi-Cn^-^" / \ogn + \ogp)}. 

The parameter k > controls the rate at which the minimum signal min^gyi \ f3j\ is 
allowed to converge to zero, while r > controls the rate at which the maximal 
eigenvalue of the covariance matrix S = Cov(Xi, . . . , Xp) is allowed to diverge to 
infinity. In fact, we insist that minjg_4 > n^" and Amax(S) < n'^ for large n, where 
Amax(S) denotes the maximal eigenvalue of S. Thus, these technical conditions ensure 
that any non-zero signal is not too small, and that the predictors are not too close to 
being collinear, and the dimensionality is also controlled via log p = o(?7,^~^''/ logn). 



which is still of an exponential order. See iFan and Lvl (120081 ) for further discussion 
of the sure screening property. 

An iterated version of this algorithm is also available. At the first stage we apply SIS, 
taking enough variables in equal-sized sets of active indices A^^^ and A^^^ to ensure 
that the intersection Ai = A^i^ fl A^i^ has ki elements. Applying penalized likelihood 
to the variables with indices in Ai gives a first approximation A^i to the true set of 
active indices. We then carry out a second stage of the ISIS procedure^ of Section [2] 
to each partition separately to obtain equal-sized new sets of indices ^2^'* and A'^\ 
taking enough variables to ensure that A2 = A'^^ fl A^^ has k2 elements. Penalized 
likelihood applied to TWi fl ^2 gives a second approximation to the true set of 
active indices. As with the first variant, we continue until we reach an iteration i 
with = A4i-i, or we have recruited d indices. 



15 



4 Numerical results 



We illustrate the breadth of applicability of (I) SIS and its variants by studying its 
performance on simulated data in four different contexts: logistic regression, Poisson 
regression, robust regression (with a least absolute deviation criterion) and multi- 
class classification with support vector machines. We will consider three different 
configurations of the p = 1000 predictor variables Xi, . . . , Xf 



Case 1: Xi, . . . , Xp are independent and identically distributed A^(0, 1) random vari- 
ables 

Case 2: Xi, . . . , Xp are jointly Gaussian, marginally X(0, 1), and with corr(Xj, X4) = 
1/a/2 for alH 7^ 4 and corr(Xj,Xj) = 1/2 if i and j are distinct elements of 
{1,---,P}\{4} 

Case 3: Xi, . . . , Xp are jointly Gaussian, marginally X(0, 1), and with corr(Xj, X5) = 

for all i ^ 5, corr(Xi, X4) = 1/^2 for all i i {4, 5}, and corr(Xi, Xj) = 1/2 if 

1 and j are distinct elements of {1, . . . ,p} \ {4, 5}. 



Case 1, with independent predictors, is the most straightforward for variable selection. 
In Cases 2 and 3, however, we have serial correlation such that corr(Xj,Xj) does not 
decay as \i — j\ increases. We will see later that for both Case 2 and Case 3 the 
true coefficients are chosen such that the response is marginally uncorrelated with 
X4. We therefore expect variable selection in these situations to be more challenging, 
especi ally for the non-ite rated versions of SIS. Notice that in the asymptotic theory of 
SIS in lFan and Lvl (120081 ) . this type of dependence is ruled out by their Condition (4). 



4.1 Logistic regression 



In this example, the data (x^, Yi), . . . , (x^, y„) are independent copies of a pair 
(x^, Y), where Y is distributed, conditional on X = x, as Bin(l,p(x)), with log(Y3^^] 
(3o + x^/3. We choose n = 400. 

The binary response of the logistic regression model is less informative than, say, 
the real-valued response in a linear model, which explains the larger sample size in 
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this example than in those that follow. It was also the reason for choosing d = 
LlT^n^J = 16 in both the vanilla version of SIS outlined in Section [2] (Van-SIS), and 
the second variant (Var2-SIS) in Section [3l2l For the first variant (Varl-SIS), however, 
we used d = I I = 66: note that since this means the selected variables are in the 

L log n -I ' 

intersection of two sets of size d, we typically end up with far fewer than d variables 
selected by this method. 

For the logistic regression example, the choice of final regularization parameter A for 
the SCAD penalty (after all (I) SIS steps) was made by means of an independent 
tuning data set of size n, rather than by cro ss-validation. Th i s also applies for the 



LASSO and Nearest Shrunken Centroid (NSC. lTibshirani et al.l . l2003l ) methods which 
we include for comparison; instead of using SIS, this method regularizes the log- 
likelihood with an Li-penalty. The reason for using the independent tuning data set 
is that the lack of information in the binary response means that cross-validation is 
particularly prone to overfitting in logistic regression. 

The coefficients used in each of the three cases were as follows: 



Case 1: (3o = 0, A = 1.2439, (32 = -1.3416, /^g = -1.3500, = -1.7971, (3^ = 
— 1.5810, Pq = —1.5967, and Pj = for j > 6. The corresponding Bayes test 
error is 0.1368. 

Case 2: po = 0, pi = 4, p2 = A, ps = 4, p4 = -6^2, and pj = for j > 4. The 
Bayes test error is 0.1074. 

Case 3: po = 0, pi = 4, p2 = 4, /^g = 4, p^ = -6^2, p^ = 4/3, and pj = for j > 5. 
The Bayes test error is 0.1040. 



In Case 1, the coefficients were chosen randomly, and were generated as (4 logn/ ^/n + 
\Z\/A)U with Z ~ A^(0, 1) and U = 1 with probability 0.5 and —1 with probability 
—0.5, independent of Z. For Cases 2 and 3, the choices ensure that even though 
P4 ^ 0, we have that X4 and Y are independent. The fact that X4 is marginally 
independent of the response is designed to make it difficult for a popular method 
such as the two-sample t test or other independent learning methods to recognize this 
important variable. Furthermore, for Case 3, we add another important variable X5 
with a small coefficient to make it even more difficult to identify the true model. For 
Case 2, the ideal variables picked up by the two sample test or independence screening 
technique are Xi, X2 and X3. Using these variables to build the ideal classifier, the 
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Bayes risk is 0.3443, which is much larger than the Bayes error 0.1074 of the true 
model with Xi, X2, X^, X4^. In fact one may exaggerate Case 2 to make the Bayes 
error using the independence screening technique close to 0.5, which corresponds to 
random guessing, by setting /3o = 0, Pi = P2 = Ps = o,, Pm = a ioi m = 5,6, ■ ■ ■ ,j, 
P4 = —a{j — 1)^/2/2, and /3m = for m > j. For example, the Bayes error using 
the independence screening technique, which deletes X4, is 0.4608 when j = 60 and 
a = 1 while the corresponding Bayes error using X^, m = 1, 2, ■ ■ ■ , 60 is 0.0977. 

In the tables below, we report several performance measures, all of which are based 
on 100 Monte Carlo repetitions. The first two rows give the median Li and squared 

L2 estimation errors ||/3-3||i = E?=o l/^j' -/^jl and ||/3-3lli = EU^Pj-Pj?- The 
third row gives the proportion of times that the (I) SIS procedure under consideration 
includes all of the important variables in the model, while the fourth reports the 
corresponding proportion of times that the final variables selected, after application 
of the SCAD or LASSO penalty as appropriate, include all of the important ones. 
The fifth row gives the median final number of variables selected. Measures of fit 
to the training data are provided in the sixth, seventh and eighth rows, na mely the 



medi an values of 2Q{Pq, (3), defined in ([T]), Akaike's information criterion (lAkaikd . 



1974J ). which adds twi ce the number of variables in the final model, and the Bayesian 



information criterion (jSchwarzl . 119781 ) . which adds the product of log n and the number 
of variables in the final model. Finally, an independent test data set of size lOOn was 
used to evaluate the median value of 2Q{Pq, f3) on the test data (Row 9), as well as 
to report the median 0-1 test error (Row 10), where we observe an error if the test 
response differs from the fitted response by more than 1/2. 

Table [1] compares five methods, Van-SIS, Varl-SIS, Var2-SIS, LASSO, and NSC. The 
most noticeable observation is that while the LASSO always includes all of the im- 
portant variables, it does so by selecting very large models - a median of 94 variables, 
as opposed to the correct number, 6, which is the median model size reported by all 
three SIS-based m etho ds. This is due to the bias of the LASSO, as pointed out by 



Fan and Lil (120011 ) and lZoul (120061 ) . which encourages the choice of a small regulariza- 
tion parameter to make the overall mean squared error small. Consequently, many 
unwanted variables are also recruited. Thus the LASSO method has large estimation 
error, and while 2Q{Pq, (3) is small on the training data set, this is a result of overfit, 
as seen by the large values of AIC/BIC, 2Q{Po,f3) on the test data and the 0-1 test 



error. 



As the predictors are independent in Case 1, it is unsurprising to see that Van-SIS 
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has the best performance of the three SIS-based methods. Even with the larger value 
of d used for Varl-SIS, it tends to miss important variables more often than the other 
methods. Although the method appears to have value as a means of obtaining a 
minimal set of variables that should be included in a final model, we will not consider 
Varl-SIS further in our simulation study. 



Table 1: Logistic regression, Case 1 





Van-SIS 


Varl-SIS 


Var2-SIS 


LASSO 


NSC 




1.1093 


1.2495 


1.2134 


8.4821 


N/A 


11/3 -3lli 


0.4861 


0.5237 


0.5204 


1.7029 


N/A 


Prop. incl. (I) SIS models 


0.99 


0.84 


0.91 


N/A 


N/A 


Prop. incl. final models 


0.99 


0.84 


0.91 


1.00 


0.34 


Median final model size 


6 


6 


6 


94 


3 


2Q0o3) (training) 


237.21 


247.00 


242.85 


163.64 


N/A 


AIC 


250.43 


259.87 


256.26 


352.54 


N/A 


BIG 


277.77 


284.90 


282.04 


724.70 


N/A 


2Q(/3o,3) (test) 


271.81 


273.08 


272.91 


318.52 


N/A 


0-1 test error 


0.1421 


0.1425 


0.1426 


0.1720 


0.3595 



In Cases 2 and 3, we also consider the iterated versions of Van-SIS and Var2-SIS, 
which we denote Van-ISIS and Var2-ISIS respectively. At each intermediate stage 
of the ISIS procedures, the Bayesian information criterion was used as a fast way of 
choosing the SCAD regularization parameter. 

From Tables [2] and [3|, we see that the non-iterated SIS methods fail badly in these 
awkward cases. Their performance is similar to that of the LASSO method. On the 
other hand, both of the iterated methods Van-ISIS and Var2-ISIS perform extremely 
well (and similarly to each other). 

4.2 Poisson regression 

In our second example, the generic response Y is distributed, conditional on X = x, 
as Poisson(yu(x)), where log /i(x) = /?o + 

Due to the extra information in the count response, we choose n = 200, and apply 
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Table 2: Logistic regression, Case 2 





Van QTC! 


V an-ioio 


Vdl z-oio 


V di Z-iOiO 




IN DO 




20.0504 


1.9445 


20.1100 


1.8450 


21.6437 


N/A 




9.4101 


1.0523 


9.3347 


0.9801 


9.1123 


N/A 


Prop. incl. (I)SIS models 


0.00 


1.00 


0.00 


1.00 


N/A 


N/A 


Prop. incl. final models 


0.00 


1.00 


0.00 


1.00 


0.00 


0.21 


Median final model size 


16 


4 


16 


4 


91 


16.5 


2Q(/3o,3) (training) 


307.15 


187.58 


309.63 


187.42 


127.05 


N/A 


AIC 


333.79 


195.58 


340.77 


195.58 


311.10 


N/A 


BIG 


386.07 


211.92 


402.79 


211.55 


672.34 


N/A 


2g(/3o,3) (test) 


344.25 


204.23 


335.21 


204.28 


258.65 


N/A 


0-1 test error 


0.1925 


0.1092 


0.1899 


0.1092 


0.1409 


0.3765 


Table 3: Logistic regression, Case 3 




Van-SIS 


Van-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


NSC 


11/3 


20.5774 


2.6938 


20.6967 


3.2461 


23.1661 


N/A 


11/3 -3lli 


9.4568 


1.3615 


9.3821 


1.5852 


9.1057 


N/A 


Prop. incl. (I)SIS models 


0.00 


1.00 


0.00 


1.00 


N/A 


N/A 


Prop. incl. final models 


0.00 


0.90 


0.00 


0.98 


0.00 


0.17 


Median final model size 


16 


5 


16 


5 


101.5 


10 


2Q0o3) (training) 


269.20 


187.89 


296.18 


187.89 


109.32 


N/A 


AIC 


289.20 


197.59 


327.66 


198.65 


310.68 


N/A 


BIC 


337.05 


218.10 


389.17 


219.18 


713.78 


N/A 


2Q(/3o,3) (test) 


360.89 


225.15 


358.13 


226.25 


275.55 


N/A 


0-1 test error 


0.1933 


0.1120 


0.1946 


0.1119 


0.1461 


0.3866 



all versions of (I)SIS with d = [2!^^] ~ ^7. We also use 10-fold cross-validation to 
choose the final regularization parameter for the SCAD and LASSO penalties. The 
coefficients used were as follows: 

Case 1: po = 5, pi = -0.5423, p2 = 0.5314, ps = -0.5012, P4 = -0.4850, P5 = 
-0.4133, p6 = 0.5234, and pj = for j > 6. 
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Case 2: Pq = 5, pi = 0.6, p2 = 0.6, = 0.6, P^ = -0.9V2, and Pj = for j > 4. 

Case 3: po = 5, /?i = 0.6, p2 = 0.6, p^ = 0.6, /?4 = -0.9\/2, /Jg = 0.15, and pj = 
for j > 5. 

In Case 1, the magnitudes of the coefficients Pi, . . . , Pe were generated as (^^^ + 
\Z\/8)U with Z ~ A^(0, 1) and U = 1 with probabihty 0.5 and —1 with probabihty 
0.5, independently of Z. Again, the choices in Cases 2 and 3 ensure that, even though 
Pi 7^ 0, we have corr(X4, Y) = 0. 

The results are shown in Tables IH and O Even in Case 1, with independent 
predictors, the ISIS methods outperform SIS, so we chose not to present the results 
for SIS in the other two cases. Again, both Van-ISIS and Var2-ISIS perform extremely 
well, almost always including all the important variables in relatively small final 
models. The LASSO method continues to suffer from overfitting, particularly in the 
difficult Cases 2 and 3. 



Table 4: Poisson regression. Case 1 





Van-SIS 


Van-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


11/3 


0.0695 


0.1239 


1.1773 


0.1222 


0.1969 


11/3 -3iii 


0.0225 


0.0320 


0.4775 


0.0330 


0.0537 


Prop, inch (I) SIS models 


0.76 


1.00 


0.45 


1.00 


N/A 


Prop, inch final models 


0.76 


1.00 


0.45 


1.00 


1.00 


Median final model size 


12 


18 


13 


17 


27 


2Q(/3o,3) (training) 


1560.85 


1501.80 


7735.51 


1510.38 


1534.19 


AIC 


1586.32 


1537.80 


7764.51 


1542.14 


1587.23 


BIC 


1627.06 


1597.17 


7812.34 


1595.30 


1674.49 


2Q(/3o,3) (test) 


1557.74 


1594.10 


14340.26 


1589.51 


1644.63 



4.3 Robust regression 

We have also conducted similar numerical experiments using Li-regression for the 
three cases in an analogous manner to the previous two examples. We obtain similar 
results. Both versions of ISIS are effective in selecting important variables with rel- 
atively low false positive rates. Hence, the prediction errors are also small. On the 
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Table 5: Poisson regression, Case 2 





"1 7" TOTO 

Van-IbIb 


Var2-Iblb 


T A OO 

LAbbO 


11/3 -/3||i 


0.2705 


0.2252 


3.0710 


\\/3-M 


0.0719 


0.0667 


1.2856 


Prop. incl. (I) SIS models 


1.00 


0.97 


N/A 


Prop. incl. final models 


1.00 


0.97 


0.00 


Median final model size 


18 


16 


174 


2g(/3o,3) (training) 


1494.53 


1509.40 


1369.96 


AIC 


1530.53 


1541.17 


1717.91 


BIG 


1589.90 


1595.74 


2293.29 


2Q(/3o,3) (test) 


1629.49 


1614.57 


2213.10 


Table 6: Poisson regression, Case 3 




Van-ISIS 


Var2-ISIS 


LASSO 


11/3 


0.2541 


0.2319 


3.0942 


ll/?-3lli 


0.0682 


0.0697 


1.2856 


Prop. incl. (I) SIS models 


0.97 


0.91 


0.00 


Prop. incl. final models 


0.97 


0.91 


0.00 


Median final model size 


18 


16 


174 


2g(/3o,3) (training) 


1500.03 


1516.14 


1366.63 


AIC 


1536.03 


1546.79 


1715.35 


BIC 


1595.40 


1600.17 


2293.60 


2Q(/3o,3) (test) 


1640.27 


1630.58 


2389.09 



other hand, LASSO missed the difficult variables in cases 2 and 3 and also selected 
models with a large number of variables to attenuate the bias of the variable selection 
procedure. As a result, its prediction errors are much larger. To save space, we omit 
the details of the results. 
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4.4 Linear regression 



Note that our new ISIS procedure allows variable deletion in e a ch st ep. It is an 
important improvement over the original proposal of iFan and Lvl (120081 ) even in the 
ordinary least-squares setting. To demonstrate this, we choose Case 3, the most 
difficult one, with coefficients given as follows. 



Case 3: jSo 
J > 5. 



0, pi = 5, P2 = 5, p3 = 5, /34 = -15v^/2, = 1, and pj 



for 



The response Y is set as Y = with in dependent e ~ A^(0, 1). This model is the 

same as Example 4.2.3 of lFan and Lvl (120081 ). Using n = 70 and d = n/2, our new ISIS 
method includes all five important variables for 91 out of the 100 repetitions, while 
the original ISIS without variable deletion includes all the important variables for only 
36 out of the 100 repetitions. The median model size of our new variable selection 
procedure with variab le deletion is 21, w' hereas the median model size corresponding 
to the original ISIS of iFan and La] (l2008h is 19. 



We have also conducted the numerical experiment with a different sample size n = 100 
and d = n/2 = 50. For 97 out of 100 repetitions, our new ISIS includes all the 
important variables while ISIS without variable deletion includes all the important 
variables for only 72 repetitions. Their median model sizes are both 26. This clearly 
demonstrates the improvement of allowing variable deletion in this example. 



4.5 Multicategory classification 

Our final example in this section is a four-class classification problem. Here we study 
two different predictor configurations, both of which depend on first generating in- 
dependent Xi, . . . ,Xp such that Xi, . . . , X4 are uniformly distributed on [— -\/3, \/3], 
and X5, . . . ,Xp are distributed as N{0, 1). We use these random variables to generate 
the following cases: 

Case 1: Xj = Xj for j = 1, . . . ,p 

Case 2: Xi = Xi - V2X5, X2 = X2 + V2X5, X3 = X3 - V2X5, X4 = X4 + ^2X5, 
and Xj = \/3Xj for j = 5, . . . , p. 
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Conditional on X = x, the response Y was generated according to P{Y = A;|X = 
x) oc exp{/fc(x)}, for k = 1, ... ,4, where /i(x) = —axi + 0x4, /2(x) = axi — ax2, 
/3(x) = ax2 — axs and /4(x) = ax^ — ax^ with a = b/Vs. 

In both Case 1 and Case 2, all predictors have the same standard deviation since 
sd(Xj) = 1 for j = 1, 2, ■ ■ ■ ,p in Case 1 and sd(Xj) = ^/S for j = 1,2, - ■ ■ ,p in 
Case 2. Moreover, for this case, the variable X5 is marginally unimportant, but jointly 
significant, so it represents a challenge to identify this as an important variable. For 
both Case 1 and Case 2, the Bayes error is 0.1373. 



For th e multicategory classification we use the loss function proposed by iLee et al. 



(I2OO4J ). Denote the coefficients for the kth class by (3ok and f3,^ for k = 1,2, 3, 4, and 
let B = ((/?oi,/3r)^,(/9o2,/3D^,(/9o3,/3D^,(/9o4,/3l)^). Let /,(x) ^ /,(x, /3o,, = 
Pok + x^/3fc, k = 1,2, 3, 4, and 

f(x) ^f(x,B) = (/i(x),/2(x),/3(x),/4(x)f . 

The loss function is given by L(Y, f(x)) = Xl^yy [1 + + ) where [ip]+ = if t/' > 

and otherwise. Deviating slightly from our standard procedure, the marginal utility 
of the j-feature is defined by 



n 4 

L,- = min m, f(^.. , B)) + -Y,P% 



2 

k=l 



to avoid possible unidentifiablity issues due to the hinge loss function. Analogous 
modification is applied to ([5]) in the iterative feature selection step. With estimated 
coefficients Pok and /J^, and /fc(x) = /3ofc + yii'^ (3^ for k = 1,2,3,4, the estimated 
classification rule is given by argmaXj;./fc(x). There a re some other ap propriate multi- 



category loss functions such as the one proposed by iLiu et al.l (120051 ) 



As with the logistic regression example in Section l^?Tl we use n = 400, d = [ ^^"^^ J = 
16 and an independent tuning data set of size n to pick the final regularization 
parameter for the SCAD penalty. 

The results are given in Table The mean estimated testing error was based on 
a further testing data set of size 200n, and we also report the standard error of 
this mean estimate. In the case of independent predictors, all (I) SIS methods have 
similar performance. The benefits of using iterated versions of the ISIS methodology 
are again clear for Case 2, with dependent predictors. 
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Table 7: Multicategory classification 





Van-SIS 


Van-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


NSC 




Case 1 


Prop. incl. (I) SIS models 


1.00 


1.00 


0.99 


1.00 


N/A 


N/A 


Prop. incl. final model 


1.00 


1.00 


0.99 


1.00 


0.00 


0.68 


Median modal size 


2.5 


4 


10 


5 


19 


4 


0-1 test error 


0.3060 


0.3010 


0.2968 


0.2924 


0.3296 


0.4524 


Test error standard error 


0.0067 


0.0063 


0.0067 


0.0061 


0.0078 


0.0214 




Case 2 


Prop. incl. (I) SIS models 


0.10 


1.00 


0.03 


1.00 


N/A 


N/A 


Prop. incl. final models 


0.10 


1.00 


0.03 


1.00 


0.33 


0.30 


Median modal size 


4 


11 


5 


9 


54 


9 


0-1 test error 


0.4362 


0.3037 


0.4801 


0.2983 


0.4296 


0.6242 


Test error standard error 


0.0073 


0.0065 


0.0083 


0.0063 


0.0043 


0.0084 



5 Real data examples 

In this section, we apply our proposed methods to two real data sets. The first one has 
a binary response while the second is multi-category. We treat both as classification 
problems and use the hinge loss discussed in Section 14. 5[ We compare our methods 
with two alternatives: the LASSO and NSC. 



5.1 Neuroblastoma data 



We first consider the neuroblastoma data used in lOberthuer et al.l (l2006l ). The study 
consists of 251 patients of the German Neuroblastoma Trials NB90-NB2004, diag- 
nosed between 1989 and 2004. At diagnosis, patients' ages range from to 296 
months with a median age of 15 months. They analyzed 251 neuroblastoma spec- 
imens using a customized oligonucleotide microarray with the goal of developing a 
gene expression-based classification rule for neuroblastoma patients to reliably predict 
courses of the disease. This also provides a comprehensive view on which set of genes 
is responsible for neuroblastoma. 
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The complete data set, obtained via the MicroArray Quahty Control phase-II (MAQC- 
II) project, includes gene expression over 10,707 probe sites. Of particular interest is 
to predict the response labeled "3-year event-free survival" (3-year EFS) which is a 
binary variable indicating whether each patient survived 3 years after the diagnosis of 
neuroblastoma. Excluding five outlier arrays, there are 246 subjects out of which 239 
subjects have 3-year EFS information available with 49 positives and 190 negatives. 
We apply SIS and ISIS to reduce dimensionality from p = 10, 707 to d = 50. On 
the other hand, our competitive methods LASSO and NSC are applied directly to 
p = 10, 707 genes. Whenever appropriate, five-fold cross validation is used to select 
tuning parameters. We randomly select 125 subjects (25 positives and 100 negatives) 
to be the training set and the remainder are used as the testing set. Results are re- 
ported in the top half of Table [3 Selected probes for LASSO and all different (I)SIS 
methods are reported in Table [91 

In MAQC-II, a specially designed end point is the gender of each subject, which 
should be an easy classification. The goal of this specially designed end point is to 
compare the performance of different classifiers for simple classification jobs. The 
gender information is available for all the non-outlier 246 arrays with 145 males and 
101 females. We randomly select 70 males and 50 females to be in the training set 
and use the others as the testing set. We set d = 50 for our SIS and ISIS as in the 
case of the 3-year EFS end point. The results are given in the bottom half of Table 
m Selected probes for all different methods are reported in Table [TOl 



Table 8: Results from analyzing two endpoints of the neuroblastoma data 



End point 




SIS 


ISIS 


var2-SIS 


var2-ISIS 


LASSO 


NSC 


3-year EFS 


No. of predictors 
Testing error 


5 

19/114 


23 
22/114 


10 
22/114 


12 
21/114 


57 
22/114 


9413 
24/114 


Gender 


No. of predictors 
Testing error 


6 

4/126 


2 

4/126 


4 

4/126 


2 

4/126 


42 
5/126 


3 

4/126 



We can see from Table [8] that our (I) SIS methods compare favorably with the LASSO 
and NSC. Especially for the end point 3-year EFS, our methods use fewer predictors 
while giving smaller testing error. For the end point GENDER, Table [10] indicates 
that the most parsimonious model given by ISIS and Var2-ISIS is a sub model of 
others. 
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Table 9: 


O 1 X 1 1 

bciected probes 


lor tnc 


Probe 


SIS ISIS 


var2-SIS var2-ISIS 


LASSO 


'A_23_P160638' 






X 


'A_23_P168916' 


X 




X 


'A_23_P42882' 


X 






'A_23_P145669' 






X 


'A_32_P50522' 






X 


■A_23_P34800' 






X 


'AJ23_P86774' 


X 






'AJ23-P417918' 




X 


X 


'A_23_P100711' 






X 


'A.23.P145569' 






X 


'A.23.P337201' 






X 


'A_23_P 56630' 


X 


X 


X 


'AJ23_P208030' 








'A.23_P211738' 


X 






'A.23_P153692' 






X 


'A_24_P148811' 




X 




'A_23_P126844' 




X 


X 


'AJ23_P26194' 








'AJ24_P399174' 






X 


'A_24_P183664' 






X 


'A_23_P59051' 




X 




'A_24_P14464' 






X 


'A_23_P501831' 


X 


X 




■A_23_P103631' 




X 




'A_23_P32558' 




X 




'AJ23.P26873' 


X 






'AJ23.P95553' 






X 


'A.24_P227230' 


X 




X 


'A.23_P5131' 






X 


'A_23_P218841' 








'AJ23_P58036' 






X 


'A.23_P89910' 


X 






'A.24_P98783' 






X 


'A_23_P121987' 








'A_32_P365452' 






X 


'A_23_Pl()Hfi82' 


X 






•Hs58251.2' 




X 




'A_23_P121102' 


X 






'AJ23_P3242' 






X 


'A_32_P177667' 






X 


'Hs6806.2' 






X 


'Hs376840.2' 






X 


'A_24_P136691' 






X 


'Pro2BG J36.D.7' 


X 


X 




'A.23.P87401' 




X 




'A_32_P302472' 






X 


'Hs34302(i.l' 




X 




'A_23_P216225' 


X 


X 


X 


'AJ23_P203419' 








'AJ24_P22163' 


X 




X 


'A_24_P187706' 






X 


'Cl.QC 








•Hsl90380.1' 


X 




X 


•Hsll7120.1' 




X 




■A_32_P133518' 






X 


'BQCPlJ'ro26G.T5' 






X 


'AJ24_P111061' 




X 




'AJ23_P20823' 


X X 


X 


X 


'AJ24_P211151' 




X 




'Hs265827.1' 


X 




X 


'Pro25G_B12_D_7' 






X 


'Hsl56406.1' 






X 


'A.24_P902B09' 




X 




'A.32_P32653' 






X 


•Hs42896.1' 


X 






'A^2_P143793' 


X 


X 


X 


'AJ23.P391382' 






X 


'AJ23.P327134' 






X 


'Pro2BG_EQCPl.T5' 






X 


'AJ24_P361461' 




X 




'Hsl70298.1' 






X 


•A.23_P159390' 






X 


'Hs272191.1' 


X 






'r60j,135' 






X 


'Hs439489.1' 






X 


'A.23.P107296' 






X 


'A.23_P100764' 


X X 


X X 


X 


•A.23_P157()27' 


X 




27. 


'A.24_P342055' 






'AJ23.P1387' 


X 






'Hs6911.1' 






X 


'r60_l' 






X 



Table 10: Selected probe for Gender end point 

Probe SIS ISIS var2-SIS var2-ISIS LASSO NSC 

'A_23.P201035' x 

'A_24_P167642' x 

'A_24_P55295' x 

'A_24_P82200' x 

'A_23.P109614' X 

'A_24_P102053' x 

'A_23.P170551' X 
'A_23.P329835' x 

'AJ3.P70571' X 

'A^3.P259901' X 

'A^4.P222000' X 

'A^3.P160729' X 

'A_23.P95553' x x 

'A_23.P100315' X 

'AJ23.P10172' X 

'A^3.P137361' X 

'A_23_P202484' x 

'A_24_P56240' x 

'A_32_P104448' x 

'{-)3xSLvl' X 

'A_24_P648880' x 

'Hs446389.2' x 

'A_23.P259314' x x x x x x 

'Hs386420.1' x 

'Pro25G.B32_D_7' x 

'Hsll6364.2' X 

'A_32_P375286' x x 

'A_32_P152400' x 

'A_32_P105073' x 

'Hsl47756.1' X 

'Hsll0039.1' X 

'r60.al07' x 

'Hs439208.1' x 

'A^2.P506090' X 

'A_24_P706312' x 

'Hs58042.1' X 

'A_23.P128706' x 

'Hs3569.1' X 

'A_24_P182900' x 

'A_23.P92042' x 

'Hsl70499.1' X 

'A_24_P500584' x x x x x x 

'A_32_P843590' x 

'Hs353080.1' X 

'A_23.P388200' x 

'Cl.QC X 

'Hs452821.1' X 
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5.2 SRBCT data 



In this sect i on, w e apply our ni e thod to the children cancer data set reported in 



Khan et al.l (120011 ). iKhan et al.l (120011 ) used artificial neural networks to develop 
a method of classifying the small, round blue cell tumors (SRBCTs) of childhood 
to one of the four categories: neuroblastoma (NB), rhabdomyosarcoma (RMS), non- 
Hodgkin lymphoma (NHL), and the Ewing family of tumors (EWS) using cDNA gene 
expression profiles. Accurate diagnosis of SRBCTs to these four distinct diagnostic 
categories is important in that the treatment options and responses to therapy are 
different from one category to another. 

After filtering, 2308 gene profiles out of 6567 genes are given in the SRBCT data 
set. It is available online at http://research.nhgri.nih.gov/microarray/Suppleiiient7] 
It includes a training set of size 63 (12 NBs, 20 RMSs, 8 NHLs, and 23 EWS) and an 
independent test set of size 20 (6 NBs, 5 RMSs, 3 NHLs, and 6 EWS). 

Before performing classification, we standardize the data sets by applying a simple 
linear transformation to both the training set and the test set. The linear trans- 
formation is based on the training data so that, after standardizing, the training 
data have mean zero and standard deviation one. Our (I) SIS reduces dimensionality 
from p = 2308 to d = [63/log63j = 15 first while alternative methods LASSO and 
NSC are applied to p = 2308 gene directly. Whenever appropriate, a four-fold cross 
validation is used to select tuning parameters. 

ISIS, var2-ISIS, LASSO and NSC all achieve zero test error on the 20 samples in 
the test set. NSC uses 343 genes and LASSO requires 71 genes. However ISIS and 
var2-ISIS use 15 and 14 genes, respectively. 

This real data application delivers the same message that our new ISIS and var2- 
ISIS methods can achieve competitive classification performance using fewer predictor 
variables. 
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